Digging gold in IIS logfiles

This section describes how to use the free tool from Microsoft "Log Parser 2.2" with IIS logfiles.

Have you ever experienced that CPU usage skyrockets on your server, or at least somehow experieced long response times from Sitecore. Then you have likely ended up with a question like this:

I wonder what pages on my site (be it on the website or in the client) are consuming all the resources?

Well, elsewhere here on SDN5 you can read how global.asax can be used to log all files taking longer than x milliseconds. However it doesn't give you the full picture, and you have to wait until enough sample data have been collected. You don't want to wait, you want answers NOW? Well if the column "Time taken" is added to your IIS logfiles, you can get these answers. If not, hurry up and add it.

So you open your IIS logfile in Excel and do some sorting, filtering and grouping to show these data, right? Yes, if you have less than 65.000 entries and enough patience, maybe you do, but it doesn't exactly work like a dream. If you (hopefully) are familiar with SQL, you are likely to wine something like "Sigh, why can't I just do some SUM's, COUNT's and AVERAGE's in sql instead of this nightmare Excel stuff?". Now there is a solution. Try e.g.:

Just Download LogParser 2.2 from http://www.microsoft.com/downloads/details.html?FamilyID=890cd06b-abf8-4c25-91b2-f8d975cf8c07&displaylang=en
Install it on your webserver
From the logfolder on your website, run the following command: SELECT TOP 10 COUNT(*), AVG(time-taken), SUM(time-taken), cs-uri-stem INTO report_sum.txt FROM test.log WHERE EXTRACT_EXTENSION(cs-uri-stem)='aspx' GROUP BY cs-uri-stem ORDER BY sum(time-taken) DESC
Now view the top 10 urls that spent the most time being rendered in the outputfile report_sum.txt

It works like a charm - and fast. When opening the report_sum.txt afterwards, hey presto: a list as promised, showing that Default.html (frontpage of website) topped the list with 1100+ seconds totally. Also it tells me that my Maillist.html averages at a high 3 seconds per request.

COUNT(ALL *) AVG(ALL time-taken) SUM(ALL time-taken) cs-uri-stem
------------ ------------------- ------------------- ----------------
976          1139                1111939             /Default.html
100          2959                295996              /Aktuelt/Postlister.html
146          989                 144425              /Job.html
20           4663                93269               /sitecore/.../Upload/Upload2.html
...

I must admit this is nothing less than a Super Cool Tool that will allow you to easily identify problem pages in your installation. The way these SQL statements can be combined is endless, and it can even be used with Sitecore 5 logfiles.

Another usefull query I would like to show is the following that will give you statistics per hour from your logfile.

"c:\program files\log parser 2.2\logparser" "SELECT date, QUANTIZE(time,3600) as interval, COUNT(*), AVG(time-taken), SUM(time-taken), DIV(TO_REAL(COUNT(*)), 3600) AS Requests/sec INTO 'report_total_load_per_minute.txt' FROM %LOGFILE% GROUP BY date, interval ORDER BY date, interval" -o:NAT

As can be seen below, this query will e.g. tell you that there were 111K+ requests between 12a.m. and 1p.m., average time was 144 ms, and there was a whopping 31 Requests / second during this period. The total time taken is 16097 seconds or app. 4400ms per second. Now there are only 1000 ms on a real second, but as these data were taken from a 4 processor server you would multiply this to get 4000 expected CPU milliseconds per second.

date       interval COUNT(ALL *) AVG(ALL time-taken) SUM(ALL time-taken) Requests/sec
---------- -------- ------------ ------------------- ------------------- ------------
2006-07-03 12:00:00 111391       144                 16086950            30.941944
2006-07-03 13:00:00 133281       106                 14210042            37.022500
2006-07-03 14:00:00 50872        150                 7678055             14.131111
2006-07-03 15:00:00 29795        198                 5921612             8.276389
2006-07-03 16:00:00 10403        435                 4526177             2.889722

I would like to introduce a key value called "CPU-Utilization" that is to be measured the following way:

100 pct * Avg(time-taken) * (Requests / second) / (#CPU * 1000).

From the above numbers, the CPU-Utilization would now be calculated like this: 144ms * 31 ReqPerSec) / (4 * 1000) = 112%.

What does this number tell you? Well you would not expect to utilize your server above 100%, and a number above this can only mean that requests have been queued or idle within the server before a response was delivered. I would say that any CPU-utilization exceeding 80% is not sound over longer periods of time, and in these cases you should seriously consider if the workload can be decreased or the hardware should be added to the installation. The limit of 80% is taken out of the blue and may well be set too high. At a CPU-U rate of 112%, something is terribly wrong and your users are probably suffering badly.

Summary:

This super cool tool allows us to easily determine the workload of the server back in time as well as pinpoint urls that are very resource demanding. Standard .bat files can be created, which produces detailed reports on the basis of one or more logfiles in seconds. If your customer or IT department insists that their single CPU server is more than enough hardware to do the job, this may just be the evidence needed to convince them otherwise.